Finding remote protein homologs with hidden Markov models

نویسنده

  • Kim Laurio
چکیده

Detecting remote homologs by sequence similarity gets increasingly difficult as the percentage of identical residues decreases. The aim of this work was to investigate if the performance of hidden Markov models could be improved by ignoring the subsequences that exhibit high variability, and only concentrate on the truly conserved regions. This is based on the underlying assumption that these high variability regions could be unnecessary, or even misleading, during search of remote protein homologs. In this paper we challenge this assumption by identifying the high and low variability regions of multiple alignments and modifying models by focusing them on the conserved regions. The high variability regions are located with information theoretic measures and modeled by free insertion modules, which are special nodes that can be used to model arbitrarily long subsequences with a uniform probability distribution. The results do not support a definitive conclusion since a few cases exhibit a performance increase, while the general trend is that the performance decreases when ignoring high variability regions. Two supplementary tests suggest that when there is a significant performance loss due to deletion of high variability nodes, a much smaller decrease occurs when the nodes are preserved but the position-specific amino acid distributions are removed. Taken together, these results support the hypothesis that there is some valuable information present in the high variability regions that enable the model to better discriminate between true and false homologs; and that other constructs for the high variability regions could perform better.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hidden Markov models for detecting remote protein homologies

MOTIVATION A new hidden Markov model method (SAM-T98) for finding remote homologs of protein sequences is described and evaluated. The method begins with a single target sequence and iteratively builds a hidden Markov model (HMM) from the sequence and homologs found using the HMM for database search. SAM-T98 is also used to construct model libraries automatically from sequences in structural da...

متن کامل

Detecting distant homologs using phylogenetic tree-based HMMs.

It is often desired to identify further homologs of a family of biological sequences from the ever-growing sequence databases. Profile hidden Markov models excel at capturing the common statistical features of a group of biological sequences. With these common features, we can search the biological database and find new homologous sequences. Most general profile hidden Markov model methods, how...

متن کامل

Augmented training of hidden Markov models to recognize remote homologs via simulated evolution

MOTIVATION While profile hidden Markov models (HMMs) are successful and powerful methods to recognize homologous proteins, they can break down when homology becomes too distant due to lack of sufficient training data. We show that we can improve the performance of HMMs in this domain by using a simple simulated model of evolution to create an augmented training set. RESULTS We show, in two di...

متن کامل

Homology Detection via Family Pairwise Search

The function of an unknown biological sequence can often be accurately inferred by identifying sequences homologous to the original sequence. Given a query set of known homologs, there exist at least three general classes of techniques for finding additional homologs: pairwise sequence comparisons, motif analysis, and hidden Markov modeling. Pairwise sequence comparisons are typically employed ...

متن کامل

Assessing strategies for improved superfamily recognition.

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of pub...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997